A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations
نویسندگان
چکیده
Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to 40 billion computational cells executing on more than 400 billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up to more than 260 000 (2) processes. To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. The checkpointing mechanism is fully integrated in a state-of-the-art high-performance multi-physics simulation framework. We demonstrate the efficiency and robustness of the method with a realistic phase-field simulation originating in the material sciences and with a lattice Boltzmann method implementation. ∗[email protected] †[email protected] 1 ar X iv :1 70 8. 08 28 6v 2 [ cs .D C ] 2 9 Ja n 20 18
منابع مشابه
An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملTowards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures
Data: • Massive, unstructured data objects (Terabytes) • Many data objects (10³-10) ⁶ • High concurrency (10³ concurrent clients) • Fine-grain access (Megabytes) Applications: • Map-Reduce-based data-analysis applications • Governmental and commercial statistics • Data-intensive HPC simulations • Checkpointing for massively parallel computations Platforms:
متن کاملA Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend...
متن کاملHigh speed Radix-4 Booth scheme in CNTFET technology for high performance parallel multipliers
A novel and robust scheme for radix-4 Booth scheme implemented in Carbon Nanotube Field-Effect Transistor (CNTFET) technology has been presented in this paper. The main advantage of the proposed scheme is its improved speed performance compared with previous designs. With the help of modifications applied to the encoder section using Pass Transistor Logic (PTL), the corresponding capacitances o...
متن کاملEecient, Language-based Checkpointing for Massively Parallel Programs
Checkpointing and restart is an approach to ensuring forward progress of a program in spite of system failures or planned interruptions. We investigate issues in checkpointing and restart of programs running on massively parallel computers. We identify a new set of issues that have to be considered for the MPP platform, based on which we have designed an approach based on the language and run-t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1708.08286 شماره
صفحات -
تاریخ انتشار 2017